Multinode evals by Oseltamivir · Pull Request #1000 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-03T04:01:39Z

Summary

Add eval-only support for multi-node benchmarks and wire those eval results into CI collection + summary reporting.

This covers:

eval matrix selection for multi-node configs
eval-only workflow jobs for multi-node sweeps
AMD MI355X eval execution in server.sh
NVIDIA Slurm eval execution through Oseltamivir's srt-slurm fork
eval artifact upload, score validation, and multi-node-aware summary tables

How evals are run

Single-node evals are selected on 8k1k at max + median concurrency for each (model, runner, framework, precision, spec-decoding, dp-attn) group.

Multi-node evals are selected on 8k1k by taking the entry with the highest max concurrency for each (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn) group, then running eval at the
median concurrency from that config via eval-conc.

EVAL_ONLY=true starts the server with expanded eval context, skips throughput benchmarking, runs lm-eval,
writes meta_env.json + results*.json + sample*.jsonl, uploads those artifacts, then validates scores
against thresholds.

srt-slurm fork delta vs upstream

NVIDIA multinode eval uses Oseltamivir/srt-slurm@sa-submission-q1-2026 instead of ishandhanani/srt-slurm.

Compared with current upstream/main, that fork adds the eval path InferenceX needs:

a new lm-eval benchmark runner
/infmax-workspace mounting via INFMAX_WORKSPACE
EVAL_ONLY support in do_sweep.py to skip benchmark stage and run post-eval directly
full wait_for_model() health checking before eval in eval-only mode
pass-through of framework/model/topology/env metadata into the eval container
MODEL_NAME=self.config.served_model_name so eval queries the served alias, not the HF repo id
EVAL_CONC from workflow to EVAL_CONCURRENT_REQUESTS
copying eval outputs into /logs/eval_results/ for launcher-side artifact pickup

Validation

Run	Scope	Result
`23888824506`	B300 eval-only	8/8 jobs passed, GSM8K 94.69-96.13
`23802423939`	GB300 eval-only	12/12 jobs passed, GSM8K 89.92-95.98
`23882945894`	B200 multinode eval-only	17/18 jobs passed; 1 `dynamo-sglang` FP8 MTP job hit server health timeout
`23909140268`	H200 multinode eval-only	4/8 jobs passed; `dynamo-trt` passed, `dynamo-sglang` jobs failed before Slurm log creation, issue raised
`23800447228`	MI355X multinode eval-only	no-DPA rows passed GSM8K 95.68-96.66; DPA=true rows failed at 0.0, issue raised

The sglang 0.5.8 Docker image ships a newer lm-eval 0.4.9.2 commit that defaults fewshot_as_multiturn=True for chat-completion models. Since the version string matches the pinned commit, pip silently skips the install. Adding --force-reinstall ensures the pinned commit is always used regardless of what's pre-installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds dsr1-fp8-mi355x-sglang-disagg-nodpa-eval: same image/model/precision as the DPA config but with dp-attn=false and ep=1. Running evals on this will tell us if DPA is the cause of the 0% GSM8K score or if it's something else about the fp8 disagg setup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-03T04:01:49Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-04-03T04:01:49Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

functionstackx · 2026-04-03T04:03:59Z

@Oseltamivir can u upstream ur changes to srt-slurm

claude

This is a large infrastructure PR (22 files) touching multi-node CI/CD workflows, all NVIDIA runner scripts, AMD server logic, and Python result collection — including a switch from the ishandhanani/srt-slurm fork to Oseltamivir/srt-slurm across all NVIDIA runners, which warrants a human look.

Extended reasoning...

Overview

The PR adds eval-only support for multi-node benchmarks, touching GitHub Actions workflows (benchmark-multinode-tmpl.yml, e2e-tests.yml, run-sweep.yml), all six NVIDIA Slurm runner scripts, the AMD MI355X server.sh/job.slurm/submit.sh, shared benchmark_lib.sh, and Python utilities for config generation and result collection.

Security risks

The most notable concern is the switch from ishandhanani/srt-slurm to Oseltamivir/srt-slurm@sa-submission-q1-2026 across all NVIDIA multi-node runners. This changes the external code being cloned and executed on production cluster runners. While the PR description enumerates the fork's delta vs upstream, a human should validate the trust decision of pinning to this fork at this branch.

Level of scrutiny

High scrutiny is warranted. This PR touches production CI/CD infrastructure across multiple hardware platforms, introduces a new external dependency fork, and the PR description itself documents known partial failures (H200 dynamo-sglang jobs failing, MI355X DPA=true rows failing at 0.0). These open issues suggest the eval path is not fully stable across all targets yet.

Other factors

The no-bugs finding from the automated system is reassuring for logic correctness, but the scope (22 files, new workflow job types, eval artifact pipeline, split summary tables) and the documented known failures make this a PR that should have at least one human reviewer before merge.

Oseltamivir and others added 29 commits February 19, 2026 16:12

init

033aa6e

add mat

c177baa

Increase Eval Conc

6988322

8k1k evals instead of 1k1k

c0d008b

reduce conc

d73bf3d

Merge branch 'main' into multinode_eval

1a6e76c

Merge branch 'main' into multinode_eval

e63dbf4

Eval table missing spec decode

ab179c7

Merge branch 'main' into multinode_eval

d965a51

Merge branch 'main' into multinode_eval

22a341c

Merge branch 'main' into multinode_eval

0c6f500

nvda evals

e5c63dc

Merge branch 'main' into multinode_eval

94864f9

merge main

7215f1f

Merge branch 'main' into multinode_eval

df91368

update multinode to singlenode

8d26331

hanging rm rf

0b27187

debug

056a415

update conc req

61f7d9b

documentation

ffdd49b

median instead of max

7639f3d

config file guard

4ffd505

h100/h200/b200/b300 evals

0d0e1e8

Update repo

bf615b9

models_name

28a75a2

model config

98a45e9

summary table

de54974

Oseltamivir requested a review from a team April 3, 2026 04:01

Oseltamivir requested review from billishyahao and chunfangamd as code owners April 3, 2026 04:01

github-project-automation bot added this to InferenceMAX Board Apr 3, 2026

claude bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode evals#1000

Multinode evals#1000
Oseltamivir wants to merge 29 commits intomainfrom
multinode_eval

Oseltamivir commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

functionstackx commented Apr 3, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oseltamivir commented Apr 3, 2026

Summary

How evals are run

srt-slurm fork delta vs upstream

Validation

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

functionstackx commented Apr 3, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants